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Abstract 

Background:  A  low  genetic  diversity  in  Francisella  tularensis  has  been  documented.  Current  DNA  based 
genotyping  methods  for  typing  F.  tularensis  offer  a  limited  and  varying  degree  of  subspecies,  clade  and  strain  level 
discrimination  power.  Whole  genome  sequencing  is  the  most  accurate  and  reliable  method  to  identify,  type  and 
determine  phylogenetic  relationships  among  strains  of  a  species.  However,  lower  cost  typing  schemes  are 
necessary  in  order  to  enable  typing  of  hundreds  or  even  thousands  of  isolates. 

Results:  We  have  generated  a  high-resolution  phylogenetic  tree  from  40  Francisella  isolates,  including  1 3  F. 
tularensis  subspecies  holarctica  (type  B)  strains,  26  F.  tularensis  subsp.  tularensis  (type  A)  strains  and  a  single  F. 
novicida  strain.  The  tree  was  generated  from  global  multi-strain  single  nucleotide  polymorphism  (SNP)  data 
collected  using  a  set  of  six  Affymetrix  GeneChip®  resequencing  arrays  with  the  non-repetitive  portion  of  LVS 
(type  B)  as  the  reference  sequence  complemented  with  unique  sequences  of  SCHU  S4  (type  A).  Global  SNP  based 
phylogenetic  clustering  was  able  to  resolve  all  non-related  strains.  The  phylogenetic  tree  was  used  to  guide  the 
selection  of  informative  SNPs  specific  to  major  nodes  in  the  tree  for  development  of  a  genotyping  assay  for 
identification  of  F.  tularensis  subspecies  and  clades.  We  designed  and  validated  an  assay  that  uses  these  SNPs  to 
accurately  genotype  39  additional  F.  tularensis  strains  as  type  A  (A  I ,  A2,  A I  a  or  A I  b)  or  type  B  (B I  or  B2). 

Conclusion:  Whole-genome  SNP  based  clustering  was  shown  to  accurately  identify  SNPs  for  differentiation  of 
F.  tularensis  subspecies  and  clades,  emphasizing  the  potential  power  and  utility  of  this  methodology  for  selecting 
SNPs  for  typing  of  F.  tularensis  to  the  strain  level.  Additionally,  whole  genome  sequence  based  SNP  information 
gained  from  a  representative  population  of  strains  may  be  used  to  perform  evolutionary  or  phylogenetic 
comparisons  of  strains,  or  selection  of  unique  strains  for  whole-genome  sequencing  projects. 
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Background 

Francisella  tularensis,  a  Gram-negative  bacterium,  is  the 
causative  agent  of  tularemia  and  a  Category  A  select  agent. 
F.  tularensis  is  divided  into  three  subspecies  (subsp.):  tula¬ 
rensis  (type  A);  holarctica  (type  B);  and  mediasiatica  [1,2]. 
Tularemia  caused  by  type  A  strains  occurs  only  in  North 
America,  whereas  tularemia  caused  by  type  B  strains 
occurs  throughout  the  northern  hemisphere.  Together 
these  two  species  account  for  the  majority  of  cases  of 
tularemia  worldwide.  P.  tularensis  subsp.  mediasiatica 
includes  strains  predominant  in  central  Asia  [3],  P.  novic- 
ida  has  been  suggested  to  be  a  subspecies  of  F.  tularensis 
based  on  genetic  similarity  [4,5],  but  is  still  formally  rec¬ 
ognized  as  a  distinct  species.  F.  novicida  has  been  isolated 
from  North  America  and  Australia,  and  rarely  causes 
human  disease  even  though  it  can  cause  a  lethal  infection 
in  the  murine  model  of  disease  [3,6]. 

Current  DNA  based  genotyping  methods  for  typing  F. 
tularensis  offer  a  varying  degree  of  power  to  discriminate 
subspecies,  clades  and  strains  [2,7,8].  Two  clades,  A1  and 
A2,  within  F.  tularensis  subsp.  tularensis  have  been 
reported  based  on  multiple  subtyping  methods  including 
multi-locus  variable  number  tandem  repeat  analysis 
(MLVA),  pulsed-field  gel  electrophoresis  (PFGE)  and 
multi-locus  sequence  typing  [2],  Clades  within  Al,  Ala 
and  Alb,  have  been  identified  by  PFGE  [9].  A  limited 
degree  of  variation  has  been  observed  within  type  B 
strains  by  all  methods.  MLVA  currently  provides  the  high¬ 
est  degree  of  strain  discrimination  for  F.  tularensis,  how¬ 
ever  it  is  limited  in  its  ability  to  perform  evolutionary 
analyses  and  to  estimate  relationships  among  very  closely 
related  strains  [10]. 

Development  of  high-resolution  genotyping  methods  for 
F.  tularensis  can  ideally  be  met  by  whole  genome  sequenc¬ 
ing  of  multiple  strains.  Whole  genome  sequencing  is  the 
most  accurate  and  reliable  method  to  identify  and  dis¬ 
criminate  strains  of  a  species,  especially  those  species  with 
a  high  degree  of  genome  homogeneity.  Genomic 
sequence  information  of  several  type  A  and  B  strains  is 
now  available  http://www.ncbi.nlm.nih.gov/sites/ent 

rez?db=genomepri&orig  db=<Siterm=Fran _ cis- 

ella%20tularensis&cmd=Search.  F.  tularensis  has  a  single 
circular  chromosome  with  genome  size  of  ~1.89  Mb.  Nat¬ 
urally  occurring  plasmids  have  not  been  reported  for  F. 
tularensis  strains  so  far.  A  low  genetic  diversity  in  F.  tularen¬ 
sis  has  been  documented.  Based  on  whole  genome 
sequencing,  the  genetic  variation  between  the  type  B  live 
vaccine  strain  (LVS)  and  two  other  type  B  strains,  FSC200 
and  OSU18,  is  only  0.08%  and  0.11%  respectively.  F.  tula¬ 
rensis  subsp.  holarctica  strain  FSC200  is  a  virulent  strain  of 
European  origin  whereas  F.  tularensis  subsp.  holarctica 
strain  OSU18  is  a  virulent  strain  isolated  in  the  United 
States.  A  higher  genetic  variation  of  0.7%  has  been 


reported  between  a  type  B  (LVS)  and  type  A  (SCFIU  S4) 
strain  [11],  Global  single  nucleotide  polymorphism 
(SNP)  information,  based  on  whole  genome  sequencing, 
offers  several  advantages  over  existing  typing  methods 
because  each  individual  nucleotide  may  be  a  useful 
genetic  character.  The  cumulative  differences  in  two  or 
more  sequences  provide  a  larger  number  of  discriminators 
that  can  be  used  to  genotype  and  distinguish  bacterial 
strains.  Strain  genotypes  that  are  built  upon  SNP  variation 
are  highly  amenable  to  evolutionary  reconstruction  and 
can  be  readily  analyzed  in  a  phylogenetic  and  population 
genetic  context  to:  i)  assign  unknown  strains  into  well- 
characterized  clusters;  ii)  reveal  closely  related  siblings  of 
a  particular  strain;  and  iii)  examine  the  prevalence  of  a 
specific  allele  in  a  population  of  closely  related  strains  that 
may  in  turn  correlate  with  phenotypic  features  of  the 
infectious  agent  [12],  SNPs  also  provide  potential  markers 
for  the  purpose  of  strain  identification  important  for 
forensic  and  epidemiological  investigations. 

Previously,  we  reported  an  Affymetrix  GeneChip®  based 
approach  for  whole  genome  F.  tularensis  resequencing  and 
global  SNP  determination  [13].  We  now  report  the  whole 
genome  sequence  and  global  SNP  data  from  40  Prancisella 
strains  using  this  approach.  Twenty  six  F.  tularensis  type  A 
(20  Al  and  6  A2),  thirteen  F,  tularensis  type  B  and  one  F. 
novicida  strain  were  used  for  phylogenetic  SNP  analysis 
and  identification  of  high-quality  SNPs  for  use  as  typing 
markers.  Based  on  our  global  analysis  of  40  genomes,  we 
were  able  to  identify  a  series  of  SNPs  at  various  levels  of 
hierarchy.  We  used  these  SNPs  to  develop  and  validate  a 
low-cost  PCR-based  assay  for  typing  and  discriminating  F. 
tularensis  isolates. 

Methods 
Francisella  strains 

Prancisella  strains  used  for  whole  genome  sequencing  are 
listed  in  Table  1.  Strains  used  for  evaluation  of  diagnostic 
SNP  markers  are  shown  in  Table  2.  All  strains  were  identi¬ 
fied  as  either  type  A  or  type  B  by  glycerol  fermentation  or 
PGR.  Pulsed  field  gel  electrophoresis  using  Pmel  was  per¬ 
formed  for  GDC  strains  to  characterize  type  A  strains  as 
either  Al,  A2,  Ala  or  Alb  [14],  Ribotyping,  using  the 
Dupont  Qualicon  RiboPrinter  and  Pvull  restriction 
enzyme,  was  used  to  characterize  USAMRIID  type  A 
strains  as  Al  or  A2  (USAMRIID,  unpublished  method). 

Francisella  genomic  DNA 

Genomic  DNAs  of  F.  tularensis  reference  strains  LVS  and 
SCHU  S4  were  obtained  from  Dr.  Luther  Lindler  of  Global 
Emerging  Infections  Surveillance  and  Response  System  of 
Department  of  Defense.  Genomic  DNA  was  isolated  from 
the  strains  in  Table  1  and  Table  2  using  tbe  QlAamp  DNA 
mini  kit  or  Centra  Puregene  Cell  Kit  (Qiagen,  Valencia, 
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Table  I:  Francisella  strains  resequenced  in  the  study 


S.  No. 

Isolate 

Species/Subspecies 

Clade^ 

Other  strain  name 

Geographic  Source 

Year  isolated 

Source 

1 

SCHUS4 

F.  tularensis  type  A 

Al  (Ala) 

Ohio 

1941 

CDC 

2 

MAOO-2987 

F.  tularensis  type  A 

Al  (Alb) 

Massachusetts 

2000 

CDC 

3 

AROI-I  1  17 

F.  tularensis  type  A 

Al  (Alb) 

Arkansas 

2001 

CDC 

4 

KS00-I8I7 

F.  tularensis  type  A 

Al  (Ala) 

Kansas 

2000 

CDC 

5 

OKOO-2732 

F.  tularensis  type  A 

Al  (Alb) 

Oklahoma 

2000 

CDC 

6 

FRAN005 

F.  tularensis  type  A 

Al 

Illinois 

1990 

USAMRIID 

7 

FRAN006 

F.  tularensis  type  A 

Al 

Illinois 

1988 

USAMRIID 

8 

FRAN007 

F.  tularensis  type  A 

Al 

Illinois 

1988 

USAMRIID 

9 

FRAN008 

F.  tularensis  type  A 

Al 

Illinois 

1988 

USAMRIID 

10 

FRAN009 

F.  tularensis  type  A 

Al 

Illinois 

1988 

USAMRIID 

1  1 

FRANOlO 

F.  tularensis  type  A 

Al 

Illinois 

1987 

USAMRIID 

12 

FRANOI  11= 

F.  tularensis  type  A 

Al 

Illinois 

1984 

USAMRIID 

13 

FRAN0I4 

F.  tularensis  type  A 

Al 

Illinois 

1989 

USAMRIID 

14 

FRANOI  5 

F.  tularensis  type  A 

Al 

Illinois 

1988 

USAMRIID 

IS 

FRAN023 

F.  tularensis  type  A 

Al 

FoxPI 

Ohio 

1940 

USAMRIID 

16 

FRAN026 

F.  tularensis  type  A 

Al 

Schu-SOO 

Unknown 

Unknown 

USAMRIID 

17 

FRAN030 

F.  tularensis  type  A 

Al 

SOL 

Unknown 

Unknown 

USAMRIID 

18 

FRAN03  1 

F.  tularensis  type  A 

Al 

SCHERM 

Ohio 

1944 

USAMRIID 

19 

FRAN032 

F.  tularensis  type  A 

Al 

GREU 

Ohio 

Unknown 

USAMRIID 

20 

FRAN033 

F.  tularensis  type  A 

Al 

HUGH 

Ohio 

1940 

USAMRIID 

21 

WY96-34I8 

F.  tularensis  type  A 

A2 

Wyoming 

1996 

CDC 

22 

CA02-0099 

F.  tularensis  type  A 

A2 

California 

2002 

CDC 

23 

UT02-I927 

F.  tularensis  type  A 

A2 

Utah 

2002 

CDC 

24 

FRANOOl 

F.  tularensis  type  A 

A2 

38  derivative  (ATCC  6223) 

Utah 

1920  (?) 

USAMRIID 

25 

FRAN027 

F.  tularensis  type  A 

A2 

38A  (38  derivative) 

Utah 

- 

USAMRIID 

26 

FRAN028 

F.  tularensis  type  A 

A2 

Larsen  NIH38  (38  derivative) 

Utah 

- 

USAMRIID 

27 

LVS 

F.  tularensis  type  B 

Russia 

1958  (?) 

CDC 

28 

KY99-3387 

F.  tularensis  type  B 

Kentucky 

1999 

CDC 
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Table  I:  Francisella  strains  resequenced  in  the  study  (Continued) 


29 

OR96-0246 

F.  tuiarensis  type  B 

Oregon 

1996 

CDC 

30 

OR96-0463 

F.  tuiarensis  type  B 

Oregon 

1996 

CDC 

31 

KYOO-1708 

F.  tuiarensis  type  B 

Kentucky 

2000 

CDC 

32 

MOOI-1673 

F.  tuiarensis  type  B 

Missouri 

2001 

CDC 

33 

INOO-2758 

F.  tuiarensis  type  B 

Indiana 

2000 

CDC 

34 

CA99-3992 

F.  tuiarensis  type  B 

California 

1999 

CDC 

35 

FRAN004 

F.  tuiarensis  type  B 

LVS 

Russia 

1958  (?) 

USAMRIID 

36 

FRAN0I2 

F.  tuiarensis  type  B 

Alabama 

1991 

USAMRIID 

37 

FRAN024 

F.  tuiarensis  type  B 

JAP 

Japan 

1926 

USAMRIID 

38 

FRAN025 

F.  tuiarensis  type  B 

VT68 

Vermont 

1968 

USAMRIID 

39 

FRAN029 

F.  tuiarensis  type  B 

425 

Montana 

1941  (?) 

USAMRIID 

40 

FRAN003 

F.  novicida 

ATCC  15482  (Ul  12) 

Utah 

1950 

USAMRIID 

“Strains  characterized  to  the  level  of  Ala  or  Alb  by  Pmel  PFGE  are  indicated. 
'’Isolate  recovered  from  a  clinically  normal  rabbit 


CA)  according  to  the  manufacturer's  instructions. 
Genomic  DNA  samples  were  stored  at  -80  °C. 

F.  tuiarensis  custom  resequencing  array  set 

The  basis  of  the  Affymetrix  GeneChip'®  resequencing  by 
hybridization  and  the  details  of  the  design  of  F.  tuiarensis 
GeneChip®  set  has  been  described  earlier  [13],  Briefly,  the 
design  is  primarily  on  the  basis  of  the  DNA  sequence  of 
strain  TVS  (GenBank  Accession:  AM  2.33.362)  serving  as  a 
reference  and  complemented  with  unique  sequences  of 
SCHU  S4  (GenBank  Accession:  Al  749949).  A  total  of 
1,764,558  queryable  bases  were  identified  for  resequenc¬ 
ing  by  hybridization  after  exclusion  of  ~9.22%  of  repeti¬ 
tive  sequence  from  the  design.  This  sequence  was  tiled 
onto  a  set  of  six  CustomSeq  300  K  GeneChips®  by  Affyme¬ 
trix,  Inc.,  (Santa  Clara,  CA).  This  design  provides  approxi¬ 
mately  91%  of  the  F.  tuiarensis  double  stranded  genome 
sequence  information  from  holarctica  (type  B)  and  tuiaren¬ 
sis  (type  A)  subspecies.  The  whole  genome  resequencing 
was  performed  in  duplicate  for  all  query  strains. 

Whole  genome  amplification,  resequencing  assay  and  raw 
data  acquisition 

Francisella  genomic  DNA  amplification,  DNA  fragmenta¬ 
tion,  labeling,  hybridization  and  acquisition  of  raw  data 
was  carried  out  exactly  as  described  earlier  [13]. 


Processing  of  raw  data  with  bioinformatic  filters 

Hybridization  of  a  whole-genome  sample  on  an  Affyme¬ 
trix®  resequencing  array  platform  can  lead  to  incorrect 
basecalls  due  to  a  number  of  systematic  effects  that  are  less 
prevalent  when  the  sample  consists  of  a  purified  PGR 
product.  We  have  developed  bioinformatic  filters  to 
account  for  most  of  these  predictable  adverse  effects.  Our 
bioinformatic  filters  consist  of  a  set  of  Perl  scripts  that 
operate  on  the  GHP  files  generated  by  GSEQ  software  and 
produce  a  list  of  high-confidence  SNP  calls  from  the  larger 
raw  set  of  SNPs  calls  present  in  those  files.  The  scripts  are 
available  for  download  from  our  website  http:// 
pfgrc.icvi.org/index.php/compare  genomics/ 
snp  scripts.html.  Each  filter  serves  to  reduce  the  number 
of  candidate  SNPs.  The  output  of  one  fdtering  step 
becomes  the  input  for  the  next.  The  detailed  descriptions 
of  these  filters  have  been  reported  [13]. 

Briefly,  the  quality  filter  implemented  in  GSEQ  software 
initially  eliminates  SNP  calls  that  have  been  assigned  low 
quality  scores  based  on  the  difference  in  signal  intensity 
between  the  highest  intensity  probe  pair  and  the  next 
highest  intensity  pair  at  a  particular  locus.  The  first  filter 
applied  is  the  "low  homology  filter"  which  identified 
regions  that  performed  poorly  as  a  result  of  deletions  in 
the  sample  relative  to  the  reference  sequence.  The  base 
calls  from  the  GHP  files  from  GSEQ  software  are  scanned 
to  identify  regions  of  adjacent  positions  that  are  rich  in 
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Table  2:  F.  tularensis  strains  used  to  evaluate  SNP  diagnostic  markers 


S.  No. 

Isolate 

Subspecies 

Clade 

Geographic  Source 

Year  isolated 

1 

ND00-09S2 

type  A 

Al  (Ala) 

North  Dakota 

2000 

2 

MOOI-1907 

type  A 

Al  (Ala) 

Missouri 

2001 

3 

AROO-0028 

type  A 

Al  (Ala) 

Arkansas 

2000 

4 

KSOO-0948 

type  A 

Al  (Ala) 

Kansas 

2000 

5 

OKO 1-2528 

type  A 

Al  (Ala) 

Oklahoma 

2001 

6 

CAOO-0036 

type  A 

Al  (Ala) 

California 

2000 

7 

AR98-2I46 

type  A 

Al  (Ala) 

Arkansas 

1998 

8 

GA02-S497 

type  A 

Al  (Ala) 

Virginia 

1982 

9 

NCO 1-5379 

type  A 

Al  (Alb) 

North  Carolina 

2001 

10 

NY04-2787 

type  A 

Al  (Alb) 

New  York 

2004 

1  1 

AK96-2888 

type  A 

Al  (Alb) 

Alaska 

1996 

12 

OK02-0I95 

type  A 

Al  (Alb) 

Oklahoma 

2002 

13 

PA04-2790 

type  A 

Al  (Alb) 

Pennsylvania 

2004 

14 

CA04-2258 

type  A 

Al  (Alb) 

California 

2004 

IS 

GA02-S375 

type  A 

Al  (Alb) 

New  York 

1977 

16 

WY03-I228 

type  A 

A2 

Wyoming 

2003 

17 

COO  1-37 13 

type  A 

A2 

Colorado 

2001 

18 

UT07-4362 

type  A 

A2 

Utah 

2007 

19 

TX00-I59I 

type  A 

A2 

Texas 

2000 

20 

GA02-5453 

type  A 

A2 

Wyoming 

1993 

21 

WYO 1-391  1 

type  A 

A2 

Wyoming 

2001 

22 

NM99-029S 

type  A 

A2 

New  Mexico 

1999 

23 

ID04-2687 

type  A 

A2 

Oregon 

2004 

24 

AZOO-I  180 

type  B 

Arizona 

2000 

25 

AZOO-1324 

type  B 

Arizona 

2000 

26 

SP03-I782 

type  B 

Spain 

2003 

27 

WA98-I774 

type  B 

Washington 

1998 

28 

E3443 

type  B 

Oregon 

1978 
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Table  2:  F.  tularensis  strains  used  to  evaluate  SNP  diagnostic  markers  (Continued) 


29 

SP98-2I08 

type  B 

Spain 

1998 

30 

OR98-07I9 

type  B 

Oregon 

1998 

31 

RCS03 

type  B 

Russia 

- 

32 

SP03-I783 

type  B 

Spain 

2003 

33 

CN98-5979 

type  B 

Canada 

1998 

34 

NY98-2295 

type  B 

New  York 

1998 

35 

TX03-3834 

type  B 

Mississippi 

2003 

36 

IN00-27S8 

type  B 

Indiana 

2000 

37 

F48S3 

type  B 

California 

1983 

38 

OHO  1-3029 

type  B 

Kansas 

2001 

39 

CO05-3922 

type  B 

Colorado 

2005 

no-calls  and  SNP  calls.  SNP  calls  that  occur  within  the 
defined  low  homology  region  are  removed  from  the  list  of 
high-confidence  SNP  calls.  The  next  script  is  referred  to  as 
the  alternate  homology  filter.  The  alternate  homology 
effect  is  caused  by  the  sequences  in  the  query  DNA  sample 
capable  of  hybridizing  with  high  efficiency  to  more  than 
one  probe  pair  at  a  locus  on  the  array.  When  a  locus  con¬ 
tains  two  strongly  hybridizing  probe  pairs,  the  GSEQ  soft¬ 
ware  may  make  a  SNP  call,  a  reference  base  call  or  a  no¬ 
call  ("N"),  depending  on  the  relative  signal  strengths  of 
the  probe  pairs.  The  alternate  homology  filter  identifies 
SNP  calls  that  may  have  arisen  as  a  result  of  this  effect 
based  on  the  difference  in  binding  energy  between  the 
alternate  (SNP)  sequence  and  the  reference  sequence.  If 
the  difference  between  these  two  binding  energies  is  = 
11.5  kcal/mol,  the  SNP  call  is  assumed  to  be  an  artifact  of 
the  alternate  sequence  homology,  and  it  is  removed  from 
the  list  of  high  confidence  SNP  calls.  The  remaining  SNP 
calls  are  then  put  through  the  footprint  effect  filter.  The 
artifact  called  the  footprint  effect  is  caused  by  the  occur¬ 
rence  of  a  real  SNP  in  a  query  sample  that  results  in  a 
destabilizing  effect  on  25-mers  in  the  immediate  vicinity 
of  the  SNP.  The  footprint  effect  filter  algorithm  assumes 
that  a  genuine  SNP  is  most  likely  to  cause  spurious  SNP 
calls  at  locations  within  10  bases  on  either  side  of  the  gen¬ 
uine  SNP.  Any  SNP  call  that  occurs  more  than  10  base 
positions  from  the  nearest  neighboring  SNP  call  is 
assumed  to  be  valid,  and  any  SNP  call  that  has  one  or 
more  neighbors  within  10  base  positions  is  subjected  to 
the  filter.  Since  any  number  of  consecutive  SNP  calls 
within  10  base  positions  of  each  other  may  occur  in  the 
data,  this  filter  is  implemented  as  a  recursive  algorithm. 


For  each  list  of  consecutive  SNP  calls  that  each  lies  within 
10  bases  of  its  neighbors,  the  algorithm  identifies  the  SNP 
call  having  the  highest  quality  score.  That  SNP  call  is 
accepted  as  valid,  and  its  immediate  neighbors  are 
removed  from  the  list  of  high  confidence  SNP  calls.  This 
action  may  break  the  original  list  of  neighboring  SNP  calls 
into  two  separate  lists.  All  resulting  lists  are  processed 
recursively  in  the  same  way,  until  all  of  the  SNP  calls  have 
been  accepted  or  rejected.  This  algorithm  is  implemented 
in  the  RemoveFootprintEffect.pl  Perl  program.  All  the 
above  filters  are  applied  to  individual  data  sets  generated 
for  any  sample,  following  which  a  final  filter  referred  to  as 
the  replicate  combination  filter  is  applied.  The  replicate 
combination  filter  generates  the  list  of  common  SNPs 
present  in  both  the  experiments. 

Phylogenetic  clustering,  selection  of  SNP  markers  and  PCR 
primer  design  from  multistrain  global  Francisella  SNP 
collection 

We  generated  a  phylogenetic  tree  from  the  resequencing 
data  by  considering  only  those  locations  at  which  a  SNP 
occurred  in  one  or  more  of  the  forty  strains.  For  each 
strain,  we  constmeted  a  sequence  containing  the  base  calls 
at  each  of  the  locations  at  which  a  SNP  was  found  in  some 
strain(s).  This  resulted  in  forty  sequences,  each  containing 
19,897  base  calls  (including  no-calls)  which  were  used  for 
the  phylogenetic  analysis.  The  phylogenetic  tree  was  gen¬ 
erated  using  the  MrBayes  program,  version  3.1.2  [15-17], 
The  program  was  mn  for  200,000  generations,  using  a 
haploid  model.  The  root  of  the  resulting  tree  was  inferred 
by  midpoint  rooting.  The  resulting  tree  is  reported  as  a 
cladogram  and  as  a  phylogram.  A  phylogenetic  tree  (Addi- 
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tional  File  1)  was  also  generated  from  the  same  data  using 
the  dnaml  (maximum  likelihood)  program  of  the  PHYLIP 
package  version  3.6  [18], 

Node  pairings  which  discriminated  between  subspecies  or 
clades  were  selected  for  the  development  of  diagnostic 
typing  assays.  Criteria  used  to  select  SNP  locations  for  the 
assay  were: 

1 .  The  SNP  location  must  cleanly  differentiate  the  two 
nodes  of  interest.  Within  each  of  the  nodes,  all  of  the 
member  strains  must  share  the  same  base  call  at  the 
location,  and  the  two  nodes  must  differ  at  the  loca¬ 
tion. 

2.  The  sequences  downstream  of  the  SNP  location 
must  be  in  sufficient  agreement  among  all  strains  from 
both  nodes  so  that  an  appropriate  primer  can  be  cho¬ 
sen  from  the  consensus  sequence  (the  consensus  at  the 
primer  location  may  not  contain  "N"  calls  or  any  con¬ 
flicting  base  calls). 

3.  The  primer  sequences  must  have  melting  tempera¬ 
tures  within  a  specific  limited  range  (60°C  to  70°C). 

4.  The  predicted  PCR  product  size  must  be  within  the 
range  150  to  500  bp. 

We  developed  a  set  of  programs  to  identify  candidate  SNP 
locations  for  the  real-time  PCR  (RT-PCR)  assay.  SNPTree 
uses  the  phylogenetic  tree  and  the  multi-FASTA  files  from 
the  resequencing  experiments  as  input,  assigns  arbitrary 
node  numbers  to  all  nodes  in  the  tree,  and  produces  a  set 
of  multi-FASTA  files,  one  for  each  node  in  the  tree,  of  the 
consensus  base  calls  for  each  node.  The  consensus  call  is 
"N"  unless  all  members  of  a  particular  node  share  the 
same  base  call  at  that  location.  The  program  also  produces 
a  set  of  files,  one  for  each  node,  listing  the  base  calls  that 
occur  at  every  SNP  location,  for  all  SNP  positions  detected 
within  the  entire  set  of  40  samples  (19,897  locations).  The 
program  CompareNodes  uses  the  SNP  list  files  for  any 
two  nodes  and  produces  a  list  of  SNP  locations  that 
cleanly  differentiate  the  two  nodes  (described  above).  The 
program  CreatePrimer3  uses  a  list  of  discriminating  SNP 
locations  and  the  multi-FASTA  files  for  two  nodes,  and 
creates  an  input  file  for  the  Primer3  program  [19], 
CreatePrimer3  also  chooses  the  5 '-forward  primers,  which 
are  constrained  by  the  locations  of  the  SNPs.  The  Primer3 
software  [19]  is  then  used  to  identify  appropriate  3'- 
reverse  primers.  The  Primer3  program  enforces  the  last 
three  criteria  listed  above.  This  process  resulted  in  the 
design  of  a  large  number  of  primers  for  candidate  SNP 
locations  for  most  node  pairs  that  may  be  used  as  diag¬ 
nostic  markers.  The  final  set  of  SNP  markers/locations  we 
used  was  selected  manually  by  identifying  primers  distrib¬ 


uted  over  the  entire  genome.  The  programs  SNPTree, 
CompareNodes  and  CreatePrimer3  were  developed  at  the 
J.  Craig  Venter  Institute  specifically  for  this  study  and  are 
freely  available  for  download  ftp://ftp.icvi.org/pub/soft 
ware/pfgrc/SNPTree/SNPTreePackage.tar.gz.  These  pro¬ 
grams  along  with  our  bioinformatic  filter  pipeline  can  eas¬ 
ily  be  adapted  for  other  bacterial  model  systems  for  whole 
genome  resequencing  and  SNP  phylogeny  using  the 
Affymetrix  resequencing  array  platform. 

Primer3  software  was  used  to  design  discriminating  PCR 
primers  based  on  the  set  of  discriminating  locations  iden¬ 
tified.  Three  primers  were  designed  at  each  discriminating 
location:  a  5  '-forward  primer  with  the  node  X  call  in  the  3 ' 
position;  a  5 '-forward  primer  with  the  node  Y  call  in  the  3' 
position;  and  a  single  3 '-reverse  primer.  A  base  call  at  the 
discriminating  location  is  determined  by  two  PCR  reac¬ 
tions  where  one  of  the  two  yields  a  lower  cycle  threshold 
(Ct)  value.  The  RT-PCR  primers  used  are  shown  in  Addi¬ 
tional  File  2. 

Real-time  PCR  assays  for  F.  tularensis  typing 

Real-time  PCR  assays  to  identify  F.  tularensis  subspecies 
and  clades  were  developed  using  SYBR®  Green  (BioRad, 
Flercules  CA)  which  binds  all  dsDNA  molecules,  emitting 
a  fluorescent  signal  of  a  defined  wavelength  (522  nm). 
Reactions  were  performed  in  20  pi  volume  and  contained 
80  pg  of  genomic  DNA  (0.01  ng/pl),  150  nM  of  forward 
and  reverse  primers  and  10  pi  of  iQ  SYBR®  Green  Super¬ 
mix  (BioRad,  Hercules  CA).  Reaction  components  were 
mixed  in  a  V-bottom  thin  wall  PCR  96-well  plate  (BioRad, 
Hercules  CA).  Real-time  PCR  was  performed  using  the 
iCycler  iQ  (BioRad,  Hercules,  CA)  with  the  following  ther¬ 
mal  cycling  parameters:  50 '"C  for  2  min,  95  °C  for  5  min, 
60  cycles  of  95 '’C  for  15  seconds  and  68°C  for  30  sec¬ 
onds,  72 °C  for  30  seconds,  95 '’C  for  1  min  and  finally 
55  °C  for  3  min.  The  fluorescence  was  measured  at  72'"C 
in  the  cycle  program.  A  cycle  threshold  (Ct)  was  automat¬ 
ically  generated  by  the  iCycler  iQ  Version  3.0a  analysis 
software  for  each  amplification  reaction  (BioRad,  Her¬ 
cules  CA).  Melt  curve  analysis  was  performed  to  verify  that 
no  primer  dimers  formed. 

Results 

Whole  genome  resequencing  of  strains 

Previously,  we  reported  an  Affymetrix  Inc.  GeneChip® 
array  based  whole  genome  resequencing  platform  for  F. 
tularensis.  Our  whole-genome  sequencing  by  hybridiza¬ 
tion  approach  made  use  of  a  set  of  bioinformatic  filters  to 
eliminate  a  majority  of  false  positives  and  indicated  a  base 
call  accuracy  of  99.999%  (Phred  equivalent  score  50)  for 
type  B  strain  LVS  [13].  The  base  call  accuracy  was  deter¬ 
mined  by  comparing  the  base  calls  remaining  after  the 
application  of  our  filters  to  the  published  sequence  of  the 
LVS  strain.  The  bioinformatic  filter  programs  may  be 
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accessed  at  http://pfgrc.icvi.org/index.php/ 

compare  genomics/snp  scripts.html.  Two  type  A  strains, 
WY96  3418  and  SCHU  S4  showed  base  call  accuracies  of 
99.995%  and  99.992%  with  Phred  equivalent  scores  of  43 
and  41  respectively  [13].  We  used  this  approach  to  collect 
whole-genome  sequence  and  global  SNP  information 
from  40  Francisella  strains.  Table  1  shows  the  list  of  strains 
analyzed  in  this  study.  Twenty  six  type  A  (20  A1  and  6  A2), 
thirteen  type  B  and  one  F.  novicida  strain  were  rese¬ 
quenced. 

The  base  call  rate  and  number  of  SNPs  for  F.  tularensis  Al, 
A2  and  type  B  strains  are  shown  in  Figure  1  and  Addi¬ 
tional  File  3.  The  base  call  rate  for  all  forty  strains  was  in 
the  range  of  83.04%  to  97.92%.  This  range  improved  to 
92.43%  -  97.92%  when  the  F.  novicida  strain  FRAN003 
(base  call  rate  of  83.041%  and  total  SNPs  12407)  was 
excluded.  The  whole  genome  resequencing  call  rate  was  in 
the  range  of  94.62%  to  97.62%  for  Al  strains,  92.43%  to 
97.41%  for  A2  strains  and  94.04%  to  97.92%  for  type  B 
strains.  Overall,  type  B  strains  displayed  the  highest  aver¬ 
age  base  call  rate  of  95.97%  +  1.06%  and  A2  displayed  the 
lowest  with  94.40%  +  0.64%.  The  average  base  call  rate  for 
Al  strains  was  95.87%  +  0.64%.  The  total  number  of  SNPs 
for  all  forty  strains  ranged  widely  from  15  to  12,407.  As 
expected  FRAN003,  the  F.  novicida  strain,  displayed  the 
highest  number  of  SNPs  (12,407)  compared  to  the  F.  tula¬ 
rensis  reference  (TVS  +  SCHU  S4)  sequence.  The  wide 
range  in  SNP  differences  was  reduced  almost  by  half,  15 
to  6543,  when  the  F.  novicida  sequence  was  excluded. 

F.  tularensis  type  B  strains  displayed  the  lowest  number  of 
SNPs,  ranging  from  15  to  2915.  As  expected,  TVS  strains 
(TVS  and  FRAN004)  showed  the  fewest  SNP  positions 
(15-16)  when  compared  to  the  reference  sequence.  The 
genomes  of  all  other  type  B  strains,  except  for  FRAN024, 
contained  497  -  605  SNPs,  when  compared  to  the  refer¬ 
ence  sequence.  FRAN024  showed  a  significantly  higher 
number  of  SNPs  (2915)  compared  to  other  type  B  strains. 
FRAN024  is  a  lapanese  holarctica  strain.  It  has  been 
reported  that  the  F.  tularensis  subsp.  holarctica  isolates 
from  Japan  are  unique,  being  somewhat  intermediate  to 
F.  tularensis  subsp.  tularensis  and  the  other  F.  tularensis 
subsp.  holarctica  isolates  [20,21],  Al  strains  showed  the 
highest  number  of  SNPs  when  compared  to  the  reference 
sequence  with  a  range  of  5929  to  6543  whereas  A2  strains 
displayed  a  range  of  4732  to  5469  SNPs.  The  average 
number  of  SNPs  for  Al  strains  was  6362  +  161  and  5096 
+  281  for  A2  strains. 

Whole  genome  phylogenetic  clustering  of  strains  and  SNP 
analysis 

The  cladogram  and  phylogram  generated  from  the  whole- 
genome  resequence  SNP  data  of  all  40  Francisella  strains  is 
shown  in  Figure  2.  Phylogenetic  analysis  revealed  distinct 


clustering  of  the  strains  into  the  two  subspecies,  type  A 
and  type  B,  with  further  separation  of  strains  within  clus¬ 
ters.  F.  novicida  (FRAN003)  was  distinct  from  type  A  and 
type  B  and  formed  its  own  phylogenetic  group.  Nodes 
(including  internal  nodes  and  leaf  nodes)  of  the  phyloge¬ 
netic  tree  were  assigned  numbers  by  the  SNPTree  pro¬ 
gram.  All  type  A  strains  emerged  from  node  4,  whereas  all 
type  B  strains  emerged  from  node  50.  The  type  A  strains 
were  divided  into  two  primary  sub-nodes,  node  39  and 
node  5,  corresponding  to  clades  A2  and  Al  respectively. 
Al  strains  further  subdivided  into  node  8,  node  23,  and 
node  5,  corresponding  to  clades  Ala  and  Alb  and  the 
MAOO-2987  strain,  respectively  (Table  1).  SCHU  S4,  the 
laboratory  type  A  strain,  fell  within  the  Ala  clade  (node 
8).  Type  B  strains  also  divided  into  two  clades  based  on 
nodes  52  and  64;  these  clades  are  referred  to  here  as  B1 
and  B2,  respectively.  The  Japanese  holarctica  isolate 
FRAN024  formed  its  own  phylogenetic  group.  Subsec¬ 
tions  of  the  phylogenetic  tree  at  higher  resolution,  repre¬ 
senting  the  type  Al  (excluding  MAOO-2987),  A2  and  B 
strains  (excluding  FRAN024)  are  shown  in  Figure  3. 

Within  type  A  nodes,  strains  originating  from  distinct  geo¬ 
graphic  locations  (WY96  3418,  CA02  0099,  UT02  1927, 
KSOO  1817,  MAOO  2987,  AROl  1117,  OKOO  2732)  with  no 
known  link  to  one  another  were  clearly  resolved  by  whole 
genome  SNP  based  phylogenetic  clustering  (Figure  3, 
Table  1).  This  method  also  showed  high  potential  for  dif¬ 
ferentiating  between  closely  related  F.  tularensis  strains. 
The  Ala  strains,  SCHU  S4,  FRAN023,  FRAN031, 
FRAN032,  FRAN026,  FRAN030,  and  FRAN033  all  origi¬ 
nate  from  the  same  temporal  location  (Ohio)  in  the 
1940's  (Figure  3,  Table  1).  FRAN031  and  FRAN032  could 
not  be  distinguished  on  the  basis  of  SNPs,  suggesting  they 
may  represent  the  same  strain.  Similarly,  the  Alb  strains, 
FRAN005,  FRAN006,  FRAN007,  FRAN008,  FRAN009, 
FRANOlO,  FRAN014,  and  FRAN015  all  derive  from  cot¬ 
tontail  rabbit  from  one  state  park  in  Illinois,  with  5  or 
fewer  SNP  differences  distinguishing  these  strains  (Figure 
3,  Table  1).  The  A2  strains,  FRANOOl,  FRAN027  and 
FRAN028,  were  considered  likely  derivatives  of  the  aviru- 
lent  strain  38  (Jellison);  SNP  based  phylogenetic  cluster¬ 
ing  confirms  this  assumption  (Figure  3,  Table  1). 

Within  type  B  nodes,  strains  from  Russia  and  North  Amer¬ 
ica  were  associated  with  node  64  (B2  strains),  whereas 
only  strains  derived  from  North  America  (B1  strains)  were 
associated  with  node  52  (Figure  3,  Table  1).  Overall,  all 
unique  type  B  strains  (FRAN029,  OR96  0246,  OR96 
0463,  FRAN025,  KY99  3387,  CA99  3992,  FRAN012,  JNOO 
2758,  KYOO  1708  and  MOOl  1673)  were  resolved  using 
whole  genome  SNP  analysis. 

Table  3  summarizes  the  SNP  content  for  each  of  the  major 
nodes  identified  in  our  phylogenetic  analysis  (Figure  2). 
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Figure  I 

Whole  genome  resequencing  and  SNP  profiles  of  F.  tularensis  strains.  (A)  Whole  genome  resequencing  call  rates  and 
(B)  single  nucleotide  polymorphic  profiles  of  39  F.  tularensis  type  A  and  B  strains.  The  data  is  an  average  of  sample  analysis  per¬ 
formed  in  duplicate.  The  filtered  base  call  rate  and  the  filtered  SNP  values  were  obtained  by  processing  the  raw  data  from 
Affymetrix  software  through  our  bioinformatic  filters  [13].  Strains  are  displayed  as  either  Al ,  A2  or  type  B  for  comparative 
analysis.  F.  tularensis  subsp.  novicida  (FRAN003)  displayed  an  average  filtered  base  call  rate  of  83.041%  and  12407  filtered  SNPs 
(data  not  shown). 
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Figure  2 

Whole  genome  SNP  based  phylogenetic  analysis  of  Francisella  strains.  Phylogenetic  analysis  of  resequenced  Fran- 
cisella  strains.  The  whole-genome  resequencing  data  was  pared  down  to  those  base  positions  at  which  a  SNP  call  occurred  in 
one  or  more  of  the  forty  strains.  These  sequences  were  used  to  generate  a  phylogenetic  tree  using  the  MrBayes  program  as 
described  in  methods.  This  tree  was  then  displayed  as  a  cladogram  (A)  and  as  a  phylogram  (B)  using  the  TreeView  program 
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html.  Distinct  clustering  of  type  A  and  type  B  strains  was  observed.  Both  type 
A  and  B  strains  were  further  discriminated  within  the  clusters.  In  the  cladogram,  the  percentage  values  on  the  branches  are  the 
probabilities  of  the  partitions  indicated  by  each  branch.  The  numbers  shown  in  red  are  node  numbers  of  significant  nodes  that 
are  referenced  in  the  manuscript.  In  the  phylogram,  the  branch  lengths  are  proportional  to  the  mean  of  the  posterior  probabil¬ 
ity  density,  and  a  scale  bar  is  given  to  relate  the  branch  lengths  to  their  numeric  values. 


The  differentiating  SNPs  and  maximum  SNP  separation 
numbers  are  indicators  of  the  diversity  within  each  node, 
as  these  represent  SNP  differences  between  members  of 
the  node  (rather  than  SNP  differences  relative  to  the  refer¬ 
ence  genome).  The  differentiating  SNPs  are  the  number  of 
locations  at  which  two  or  more  member  strains  have  dif¬ 
fering  base  calls.  Maximum  SNP  separation  is  the  maxi¬ 
mum  number  of  SNP  differences  that  are  found  between 
any  two  members  of  the  node.  As  expected,  the  SNP  diver¬ 
sity  is  greatest  within  subspecies  (type  A  and  type  B)  and 


decreases  within  clades;  Bl,  Ala  and  Alb  strains  showed 
the  least  diversity  (maximum  SNP  separation  of  76,  75 
and  38,  respectively).  Typing  methods  have  previously 
revealed  less  diversity  within  type  B  than  type  A  strains 
[2,21-23].  Similarly,  our  data  show  less  diversity  among 
type  B  isolates,  with  a  maximum  SNP  separation  of  602 
when  the  Japanese  holarctica  strain  FRAN024  is  excluded 
from  this  analysis  (B*).  However,  when  all  type  B  isolates, 
including  the  Japanese  holarctica  strain  FRAN024,  are 
included  in  the  analysis,  our  data  indicates  a  similar  level 
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Figure  3 

Expanded  phylogram  for  F.  tularensis  Al,  A2  and  type  B  strains.  Expanded  sections  of  the  phylogram  (Figure  2B)  con¬ 
taining  the  F.  tularensis  A I  strains  except  MAOO  2987  (A),  A2  strains  (B)  and  type  B  strains  except  FRAN024  (C).  The  three 
subtrees  are  shown  at  different  scales.  The  scale  bars  below  each  subtree  are  given  to  relate  the  branch  lengths  to  their 
numeric  probability  values. 


of  diversity  for  types  A  and  B  (maximum  SNP  separation 
of  2779  and  2833,  respectively). 

The  presence  of  a  large  number  of  differentiating  SNPs 
within  each  phylogenetic  node  suggests  that  a  deeper  level 
of  discrimination  can  be  achieved  by  identifying  SNPs 
unique  to  individual  strains.  The  smallest  number  of  dif¬ 
ferentiating  SNPs  within  a  phylogenetic  node  was  7 1  (Alb 
strains).  The  phylogram  (Figure  2B)  indicates  that  the 
closest  clade  pairings  are  between  Ala/Alb  and  B1/B2 
which  is  quantitatively  in  agreement  with  the  SNP  differ¬ 
ences  as  shown  in  Additional  File  4.  Phylogenetic  analyses 
performed  by  two  independent  approaches  (Bayesian  in 
Figure  2  and  maximum  likelihood  in  Additional  File  1) 
showed  some  differences  only  at  the  level  of  minor  clades 
in  the  trees.  These  did  not  affect  the  subsequent  analyses. 

Typing  assays  based  on  high  quality  global  SNP  markers 

Node  pairings  that  discriminated  between  F,  tularensis 
subspecies  or  within  subspecies  were  selected  for  the 
development  of  SNP  diagnostic  typing  assays  (Figure  2). 
The  four  node  pairings  were  node  4  and  node  50,  node  52 
and  node  64,  node  39  and  node  5,  and  node  8  and  node 
23  for  discrimination  of  type  A  vs.  type  B,  B 1  vs.  B2,  A2  vs. 


Al  and  Ala  vs.  Alb,  respectively.  A  SNP  location  was 
selected  to  differentiate  between  two  nodes  in  the  tree 
when  all  strains  belonging  to  one  node  contain  the  SNP 
call  and  all  strains  belonging  to  the  other  node  contain  the 
reference  call  at  that  location.  The  location  of  the  32  in  sil¬ 
ica  identified  diagnostic  SNP  markers  in  the  F.  tularensis 
LVS  genome  are  shown  in  Figure  4.  Fourteen  SNP  loci 
were  in  the  forward  strand,  sixteen  in  the  reverse  and  two 
loci  were  in  non-coding  intergenic  regions.  The  discrimi¬ 
nating  nodes,  SNP  location,  locus  name,  gene  symbol 
with  product  and  the  role  category  is  described  in  the 
Additional  File  5. 

To  show  that  SNPs  can  be  used  as  diagnostic  markers  for 
typing  of  F,  tularensis  subspecies  and  clades,  RT-PCR 
assays  were  designed.  Initially,  seven  F,  tularensis  strains 
were  used  to  screenthe  32  RT-PCR  discriminatory  SNP 
positions  for  the  ability  to  distinguish  type  A  vs.  type  B,  Al 
vs.  A2,  Ala  vs.  Alb,  and  B1  vs.  B2.  Preliminary  results 
indicated  5  out  of  9  primer  sets  (684048,  917759, 
1014623,  1136971,  1581977)  distinguished  type  A  and 
type  B,  3  out  of  9  primer  sets  distinguished  Al  and  A2 
(521982,  1025460,  1507435),  2  out  of  5  primers  sets  dis¬ 
tinguished  Ala  and  Alb  (518892,  1574929) and  3  out  of 
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Table  3:  SNP  content  of  the  major  nodes  identified  in  the  phylogenetic  tree  (cladogram) 


Node 

Sub¬ 

species/ 

clade/sub- 

clade 

Numberof  Total  SNPs 
strains  per 
node 

Total  SNPs 
in  LVS 
genome 

Total  SNPs 
in  SchuS4 
unique 
sequence 

Common 

SNPs 

Unique 

SNPs 

Differentia 
ting  SNPs 

Maximum 

SNP 

separation 

50 

B 

13 

3771 

3686 

85 

5 

2837 

3656 

2833 

51 

B* 

12 

1  154 

1115 

39 

6 

233 

1060 

602 

52 

Bl 

7 

779 

750 

29 

385 

164 

161 

76 

64 

B2 

5 

705 

677 

28 

7 

153 

628 

549 

4 

A 

26 

8653 

8559 

94 

2905 

514 

3765 

2779 

39 

A2 

6 

6003 

5919 

84 

3789 

358 

316 

201 

5 

Al 

20 

7306 

7291 

15 

4953 

323 

497 

176 

8 

Ala 

9 

7001 

6993 

8 

5491 

277 

129 

75 

23 

Alb 

10 

7030 

7022 

8 

5537 

234 

71 

38 

*  contains  all  the  type  B  strains  with  the  exception  of  FRAN024,  Japanese  holarctica  strain. 

Total  SNPs  are  locations  at  which  a  SNP  occurs  in  one  or  more  strains  in  the  node  (if  the  same  SNP  occurs  in  more  than  one  strain,  that  location 
is  counted  only  once).  Common  SNPs  are  locations  where  all  strains  in  the  node  share  the  same  base  call,  which  is  different  from  the  reference  call 
on  the  resequencing  platform.  Unique  SNPs  are  locations  where  just  a  single  strain  in  the  node  has  a  base  call  that  differs  from  the  reference 
sequence.  Differentiating  SNPs  are  locations  at  which  at  least  two  strains  in  the  node  have  different  base  calls.  Maximum  SNP  separation  is  the 
number  of  base  calls  separating  the  two  most  distant  members  of  the  node.  Differentiating  SNPs  and  maximum  SNP  separation  are  both  indicators 
of  the  degree  of  diversity  within  the  node.  The  detection  of  diversity  is  limited  by  the  extent  to  which  our  sample  set  is  representative  of  the 
variability  within  each  clade  in  nature.  Refer  to  Figure  2  for  the  details  of  strain  clustering. 


9  primer  sets  distinguished  B1  and  B2  (299153,  470635, 
1011425).  The  two  primer  sets  from  each  group  display¬ 
ing  the  largest  difference  in  Ct  values  (shown  in  bold) 
were  pursued  further  (1014623,  1136971,  521982, 
1507435,  518892,  1574929,  299153  and  470635).  To 
determine  the  robustness  of  these  discriminatory  SNP 
positions,  an  additional  39  F.  tularensis  strains  (23  type  A, 
16  type  B)  (Table  2)  were  examined. 

The  data  for  4  primer  sets  (1014623,  521982,  299153  and 
1574929)  is  shown  in  Figure  5.  These  assays  are  hierarchi¬ 
cal  in  nature.  The  first  primer  set  determines  whether  a 
strain  is  type  A  or  type  B  based  on  SNP  1014623.  In  type 
A  and  type  B  strains,  this  nucleotide  position  is  T  and  C, 
respectively.  A  strain  identified  as  type  B  can  be  further 
typed  as  B1  or  B2  based  on  SNP  299153  (G  in  B1  strains 
and  T  in  B2  strains).  Similarly,  strains  identified  as  type  A 
can  be  classified  as  A1  or  A2  based  on  SNP  521982  (T  in 
A1  strains  and  C  in  A2  strains)  and  A1  strains  further  char¬ 
acterized  asAlaorAlbbySNP  1574929  (G  in  Ala  strains 
and  C  in  Alb  strains). 

As  shown  in  Figure  5,  the  type  A  and  type  B  SNP  assay 
clearly  distinguished  between  the  23  type  A  and  16  type  B 
strains.  The  23  type  A  strains  were  then  subdivided  into  1 5 


A1  and  8  A2  strains  and  the  15  A1  strains  were  subse¬ 
quently  further  sub-divided  into  8  Ala  and  7  Alb  strains. 
For  all  23  type  A  strains,  the  classification  of  strains  as  Al, 
A2,  Ala  or  Alb  by  diagnostic  SNP  typing  corresponds 
with  Pmel  PFGE  typing  results  (Table  2)  [14],  emphasiz¬ 
ing  the  power  and  the  utility  of  this  simpler  methodology 
for  classification  of  type  A  clades. 

Type  B  strains  were  also  resolved  into  B1  and  B2  clades 
based  on  a  single  SNP.  As  these  clades  were  newly  identi¬ 
fied  by  our  SNP  based  phylogenetic  clustering,  rese¬ 
quenced  B1  (KYOO  1708  and  MOOl-1673)  and  B2  (TVS, 
OR96  0246)  strains  were  included  as  positive  controls.  Of 
the  1 6  type  B  strains  tested,  nine  isolates  were  classified  as 
B2  and  7  isolates  were  classified  as  Bl.  Isolates  from  Rus¬ 
sia  (RC  503),  Spain  (SP03  1782  and  SP98  2108)  Finland 
(SP03  1783)  and  the  US  were  identified  as  B2  by  this 
assay,  whereas  isolates  from  Ganada  and  the  US  were 
identified  as  Bl,  providing  evidence  for  geographic  clus¬ 
tering  of  type  B  isolates  based  on  this  SNP  marker.  In  sum¬ 
mary,  this  work  shows  the  potential  for  development  of 
SNP  typing  markers  based  on  a  relatively  small  number  of 
"complete"  genome  sequences.  For  future  work,  it  will  be 
important  to  define  a  set  of  SNPs  that  could  be  used  for 
high-resolution  discrimination  to  the  strain  level. 
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Figure  4 

Location  of  in  silico  identified  diagnostic  SNP  markers  in  the  F.  tularensis  LVS  genome.  Representation  of  in  silico 
discriminating  SNP  markers  on  the  F.  tularensis  LVS  genome.  The  vertical  colored  bar  represents  the  position  of  the  SNP 
marker  on  the  LVS  with  the  relevant  node  pair  indicated  by  color.  Loci  containing  the  discriminatory  SNP  markers  in  the  for¬ 
ward  and  reverse  strands  are  shown  in  red  and  blue  respectively.  Two  markers  in  the  non-coding  sequences  of  the  genome  are 
also  shown. 


Discussion 

Whole  genome  comparative  analysis  and  collection  of 
high-confidence  global  SNPs  from  multiple  strains  of  a 
given  bacterial  species  has  a  number  of  applications  in 
both  basic  and  translational  research.  Our  study  was 
undertaken  with  an  objective  of  providing  the  scientific 
community  with  whole-genome  sequence  and  SNP  infor¬ 
mation  from  multiple  strains  of  F.  tularensis,  enabling 
rapid  advancements  in  our  understanding  of  basic  and 
applied  biology  of  this  organism.  F.  tularensis  has  been 
recognized  as  a  causative  agent  of  tularemia  for  almost  a 
century  [24]  and  is  classified  as  a  category  A  biodefense 
agent.  We  have  collected  nearly  complete  (~9 1%)  genome 
sequence  and  global  SNP  information  from  forty  Fran- 
cisella  strains  using  our  whole  genome  high-density  rese¬ 
quencing  array  platform  [13].  All  the  sequence  and  SNP 
information  is  publicly  available  to  the  scientific  commu¬ 
nity  from  Biodefense  and  Public  Health  Database  (Bio- 
HealthBase)  at  http://www.biohealthbase.org/GSearch/ 
home.do?decorator=Francisella.  BioHealthBase  is  a  Bio¬ 


informatics  Resource  Center  (BRC)  for  biodefense  and 
emerging/re-emerging  infectious  diseases  that  is  sup¬ 
ported  by  the  National  Institute  of  Allergy  and  Infectious 
Diseases  (NIAID).  The  data  can  also  be  obtained  from  our 
web  site  at  http://  pfgrc.  j  evi .  org/  index,  php/ 

compare  genomics/frandsella  genotyping.html  or 

through  the  ICVI  ftp  server  at  ftp://ftp.jcvi.org/pub/data/ 
PFGRC/Ft  DataRelease/.  This  multi-strain  high-quality 
nearly  complete  genome  sequence  and  global  SNP  infor¬ 
mation  provides  a  unique  opportunity  to  perform  com¬ 
parative  genome  analysis  between  F.  tularensis  strains, 
thus  contributing  towards  a  better  understanding  of  path¬ 
ogenicity  and  evolutionary  relationships  of  this  species. 
We  have  used  this  information  to  build  a  robust  whole 
genome  based  phylogeny  that  enabled  the  identification 
of  SNP  discriminatory  markers.  We  further  validated  high 
quality  global  SNP  markers  for  typing  of  F.  tularensis  sub¬ 
species  and  clades  as  a  proof  of  concept  that  these  markers 
may  be  used  for  future  development  of  high-resolution 
typing  methods. 
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strain 


Figure  5 

Real-time  PCR  evaluation  of  SNP  diagnostic  markers.  Evaluation  of  SNP  diagnostic  markers  using  real-time  PCR.  Data 
is  shown  for  primer  sets  A)  1 0 1 4623  discriminating  node  pairings  4  and  50  (type  A  vs.  type  B);  B)  52 1 982  discriminating  node 
pairings  5  and  39  (A  I  vs.  A2);  C)  299153  discriminating  node  pairings  52  and  64  (Bl  vs.  B2);  and  D)  1574929  discriminating 
node  pairings  8  and  23  (Ala  vs.  Alb).  The  six  control  strains  included  in  the  analysis  are  also  shown;  A I  (AROI  1117),  A2 
(WY96  3418),  Bl  (LVS,  OR96  0246)  and  B2  (KYOO  1708,  MOO  I  1673). 


Previous  reports  have  suggested  that  greater  genetic  diver¬ 
sity  exists  among  type  A  as  compared  to  type  B  strains  [2] . 
Our  whole  genome  SNP  based  analysis  of  12  type  B  iso¬ 
lates  from  North  America  and  Russia  appears  to  confirm 
this  observation.  However,  SNP  data  obtained  after  inclu¬ 
sion  of  a  lapanese  type  B  strain  (FRAN024)  indicated  a 
similar  level  of  SNP  diversity  in  type  A  and  type  B  strains 
(Table  3).  Sufficient  SNP  diversity  was  observed  among 
type  B  strains  to  generate  an  internal  structure  in  the  phy¬ 
logenetic  tree  (Figure  2)  as  well  as  to  resolve  all  unique 
strains.  The  single  F.  novicida  isolate  in  our  study, 


FRAN003  (HI  12),  had  the  lowest  base  call  rate 
(83.041%)  and  the  highest  number  of  SNPs  (12,407) 
among  our  samples.  The  low  base  call  rate  is  a  likely 
reflection  of  the  sequence  divergence  between  the  F.  novi¬ 
cida  strain  (U1 1 2)  and  the  reference  sequence  on  our  rese¬ 
quencing  chips.  Rohmer  et.  al[ll].  have  reported  a 
nucleotide  sequence  identity  of  97.8%  between  the  LVS 
and  F.  novicida  11112  genomes.  The  differences  in  these 
two  approaches  may  be  due  to  the  fact  that  array-based 
resequencing  is  sensitive  to  sequence  divergence,  and  per¬ 
forms  best  with  samples  that  are  homologous  with  the  ref- 


Page  14  of  17 

(page  number  not  for  citation  purposes) 


BMC  Microbiology  2009,  9:213 


Approved  for  public  release.  Distribution  is  unlimited 

http://www.biomedcentral.eom/1471-2180/9/213 


erence  sequence.  In  our  global  SNP  phylogenetic  analysis, 
F.  novicida  (U112)  is  well  separated  from  the  F.  tularensis 
isolates  (Figure  2B). 

A  number  of  molecular  approaches  have  been  used  to  bet¬ 
ter  understand  the  diversity  of  Francisella  [2,21,25-27], 
New  subdivisions  within  F.  tularensis  subspecies  have 
been  revealed  by  these  approaches.  Differing  methods 
provide  differing  resolution  as  most  of  the  methods  sam¬ 
ple  only  a  subset  of  the  whole  genome  in  order  to  assess 
relationships  among  different  isolates  [2],  MLVA  is  con¬ 
sidered  to  provide  the  highest  discriminatory  power  (i.e. 
strain  level)  [2,21,28].  PFGE  typing  has  been  used  to  iden¬ 
tify  four  distinct  type  A  genotypes,  Ala,  Alb,  A2a  and  A2b 
[9],  not  previously  observed  by  MLVA  typing.  PFGE  typing 
combined  with  epidemiologic  data  revealed  that  the 
observed  genetic  diversity  among  type  A  strains  correlated 
with  differences  in  clinical  outcome  and  geographic  distri¬ 
bution.  Alb  strains  were  associated  with  significantly 
higher  mortality  in  humans  as  compared  to  Ala,  A2  or 
type  B  strains.  Type  B  strains  display  little  or  no  genetic 
diversity  by  PEGE  [14]  and  a  number  of  other  molecular 
methods  [2,10,21-23]. 

Gomparative  whole-genome  sequence  analysis  provides 
the  highest  level  of  discrimination  among  different 
strains,  but  has  not  been  widely  used  due  to  the  high  cost 
of  this  method.  Keim  et  al  [2]  have  shown  a  whole- 
genome  SNP  phylogeny  of  Francisella  using  ~8000  syn- 
tenic  SNPs  from  the  published  whole  genome  sequences 
of  seven  strains.  Use  of  only  two  type  A  and  two  type  B 
genomes  was  sufficient  to  reveal  that  type  A  strains  differ 
greatly  from  each  other  unlike  type  B  strains.  More 
recently,  the  phylogenetic  structure  of  F.  tularensis  has 
been  reported  based  on  whole  genome  SNP  analysis  of 
thirteen  publicly  available  genome  sequences;  29,774 
SNPs  were  used  in  this  analysis  [4],  In  this  study,  we  have 
constructed  a  phylogenetic  profile  of  forty  Francisella 
strains  based  on  whole  genome  sequences.  This  to  our 
knowledge  is  the  first  report  of  a  phylogenetic  model 
based  on  nearly  complete  genomes  of  multiple  strains  of 
F.  tularensis  using  Affymetrix  resequencing  arrays. 

We  have  demonstrated  that  resequencing  data  may  be 
used  to  generate  high-resolution  phylogenetic  trees  based 
on  global  SNPs.  The  advantage  of  this  sequence-based 
approach  is  that  SNP  based  phylogenetic  trees  can  be  used 
for  evolutionary  analyses.  The  comparative  analysis  based 
on  the  phylogenetic  relatedness  of  strains  can  provide  sig¬ 
nificant  insights  into  the  varying  degree  of  phenotypes 
and  ecotypes  of  an  organism.  The  total  number  of  com¬ 
plete  genomes  required  to  achieve  an  optimum  phyloge¬ 
netic  profile  from  the  multiple  strains  of  an  organism  will 
be  determined  by  the  degree  of  plasticity  of  the  genome. 
Adequate  phylogenetic  relationship  can  be  determined 


with  a  sufficient  number  of  genomes  from  diverse  isolates 
of  an  organism  and  the  whole  genome  comparative  anal¬ 
ysis  of  such  related  strains  can  provide  real  biological 
insights  into  the  adaptation  and  evolution  of  a  species. 
Such  phylogenetic-based  comparative  analysis  can  cap¬ 
ture  genomic  differences  of  very  closely  related  strains  and 
provide  valuable  information  for  the  development  of 
rapid  molecular  sequence  based  assays,  capable  of  dis¬ 
crimination  to  the  strain  level. 

Conclusion 

The  whole  genome  resequencing  array  platform  provides 
sequence  and  SNP  information  from  multiple  strains  for 
any  infectious  agent  with  an  available  whole  genome 
sequence.  Multi-strain  whole  genome  sequence  data 
allows  one  to  build  robust  phylogenetic  models  for  an 
organism  based  on  global  SNPs.  Whole  genome  SNP 
based  phylogenetic  trees  can  guide  meaningful  compara¬ 
tive  analysis  of  strains  to  better  understand  the  biology  of 
an  organism  as  well  as  in  translational  research  such  as  in 
developing  high  resolution  economical  SNP  based  typing 
assays.  We  have  collected  whole  genome  sequence  and 
SNP  information  from  forty  strains  of  Francisella  to  con¬ 
struct  a  global  phylogeny.  Our  data  shows  a  good  correla¬ 
tion  with  the  previously  published  reports  using  limited 
genomic  sequence  information  and  also  provides  higher 
strain  resolution.  We  used  the  whole  genome  SNP  phylog¬ 
eny  to  identily  informative  SNP  markers  specific  to  major 
nodes  in  the  tree  and  to  develop  a  genotyping  assay  for 
subspecies  and  clades  of  F.  tularensis  strains.  Less  diverse 
type  B  strains  could  even  be  discriminated  into  two  clades, 
B1  and  B2,  based  on  a  single  SNP.  Our  whole  genome 
SNP  based  phylogenetic  clustering  shows  high  potential 
for  identifying  SNP  markers  within  F.  tularensis  capable  of 
discriminating  to  the  strain  level.  This  finding  should 
greatly  facilitate  the  rapid  and  low-cost  typing  of  F.  tula¬ 
rensis  strains  in  the  future. 
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